Python: Data Management¶

How to serialize and de-serialize data in Python. There are many, many options - we will show the ones that are currently popular and important in data science.

Pickle¶

This is probably the default serialization method used by most Python developers. Its main disadvantage is that it is Python-specific, and cannot be easily loaded in other languages. However, it is convenient if your project is Python only.

In [1]:

import pickle

In [2]:

bulbasaur = {
    "id": 1,
    "name": "bulbasaur",
    "base_experience": 64,
    "height": 7,
    "is_default": True,
    "order": 1,
    "weight": 69,
    "abilities": [
        {
            "is_hidden": True,
            "slot": 3,
            "ability": {
                "name": "chlorophyll",
                "url": "http://pokeapi.co/api/v2/ability/34/"
            }
        }
    ]
}

Pickle protocols¶

Objects can be pickled using 5 protocols. In general, use protocol 4 (HIGHEST_PROTOCOL) as it is the most flexible and supports very large objects, unless you need to share with Python 2, in which case use protocol 2.

Serialize¶

In [3]:

with open('data/bulbasaur.pickle', 'wb') as f:
    pickle.dump(bulbasaur, f, pickle.HIGHEST_PROTOCOL)

De-serialize¶

In [4]:

with open('data/bulbasaur.pickle', 'rb') as f:
    pokemon = pickle.load(f)

In [5]:

pokemon

Out[5]:

{'abilities': [{'ability': {'name': 'chlorophyll',
    'url': 'http://pokeapi.co/api/v2/ability/34/'},
   'is_hidden': True,
   'slot': 3}],
 'base_experience': 64,
 'height': 7,
 'id': 1,
 'is_default': True,
 'name': 'bulbasaur',
 'order': 1,
 'weight': 69}

Serialize to byte string¶

In [6]:

s = pickle.dumps(bulbasaur, pickle.HIGHEST_PROTOCOL)

In [7]:

Out[7]:

b'\x80\x04\x95\xd5\x00\x00\x00\x00\x00\x00\x00}\x94(\x8c\x02id\x94K\x01\x8c\x04name\x94\x8c\tbulbasaur\x94\x8c\x0fbase_experience\x94K@\x8c\x06height\x94K\x07\x8c\nis_default\x94\x88\x8c\x05order\x94K\x01\x8c\x06weight\x94KE\x8c\tabilities\x94]\x94}\x94(\x8c\tis_hidden\x94\x88\x8c\x04slot\x94K\x03\x8c\x07ability\x94}\x94(h\x02\x8c\x0bchlorophyll\x94\x8c\x03url\x94\x8c$http://pokeapi.co/api/v2/ability/34/\x94uuau.'

De-serialize from byte string¶

In [8]:

pokemon2 = pickle.loads(s)

In [9]:

pokemon2

Out[9]:

{'abilities': [{'ability': {'name': 'chlorophyll',
    'url': 'http://pokeapi.co/api/v2/ability/34/'},
   'is_hidden': True,
   'slot': 3}],
 'base_experience': 64,
 'height': 7,
 'id': 1,
 'is_default': True,
 'name': 'bulbasaur',
 'order': 1,
 'weight': 69}

Feather¶

Feather is a new and highly optimized binary serialization format for columnar tabular data that is useful for loading and saving large data frames. It can also be used to share large data frames between Python and R and Julia./

Installation in Python

pip3 install  feather-format

Installation in R

install.packages("feather")

In [10]:

from pandas_datareader import data
import arrow
import feather

Download data from Google Finance¶

In [11]:

start = arrow.get('2010-01-01')
end = arrow.get('2016-12-31')
tickers = ['AAPL', 'MSFT', 'SPY']
data_source = 'google'
panel = data.DataReader(tickers, data_source, start.datetime, end.datetime)

In [12]:

panel.keys()

Out[12]:

Index(['Open', 'High', 'Low', 'Close', 'Volume'], dtype='object')

Format closing prices¶

In [13]:

close = panel.loc['Close']
close = close.reset_index()
close.head()

Out[13]:

	Date	AAPL	MSFT	SPY
0	2010-01-04	30.57	30.95	113.33
1	2010-01-05	30.63	30.96	113.63
2	2010-01-06	30.14	30.77	113.71
3	2010-01-07	30.08	30.45	114.19
4	2010-01-08	30.28	30.66	114.57

Serialize¶

In [14]:

feather.write_dataframe(close, 'data/close.feather')

De-serialize¶

In [15]:

close2 = feather.read_dataframe('data/close.feather')
close2.head()

Out[15]:

	Date	AAPL	MSFT	SPY
0	2010-01-04	30.57	30.95	113.33
1	2010-01-05	30.63	30.96	113.63
2	2010-01-06	30.14	30.77	113.71
3	2010-01-07	30.08	30.45	114.19
4	2010-01-08	30.28	30.66	114.57

Sharing data frames between R and Python¶

In [16]:

%load_ext rpy2.ipython

In [17]:

%%R

library(feather)
close <- read_feather('data/close.feather')
head(close)

# A tibble: 6 x 4
                 Date  AAPL  MSFT    SPY
               <dttm> <dbl> <dbl>  <dbl>
1 2010-01-03 19:00:00 30.57 30.95 113.33
2 2010-01-04 19:00:00 30.63 30.96 113.63
3 2010-01-05 19:00:00 30.14 30.77 113.71
4 2010-01-06 19:00:00 30.08 30.45 114.19
5 2010-01-07 19:00:00 30.28 30.66 114.57
6 2010-01-10 19:00:00 30.02 30.27 114.73

In [18]:

%%R

write_feather(close, 'data/closeR.feather')

In [19]:

close3 = feather.read_dataframe('data/closeR.feather')
close3.head()

Out[19]:

	Date	AAPL	MSFT	SPY
0	2010-01-04	30.57	30.95	113.33
1	2010-01-05	30.63	30.96	113.63
2	2010-01-06	30.14	30.77	113.71
3	2010-01-07	30.08	30.45	114.19
4	2010-01-08	30.28	30.66	114.57

HDF5¶

HDF5 is used to store and retrieve large quantities of hierarchically organized numeric data. It can be annotated to be “self-documenting” and makes efficient use of memory. It is a standard format, so can be transferred across language platforms.

We show a simple example of saving simulation data.

Saving to disk¶

In [20]:

import h5py
import arrow
import numpy as np

In [21]:

with h5py.File('data/sim.hdf5', 'w') as root:
    root.attrs['author'] = 'Cliburn'
    root.attrs['description'] = 'Simulated data sets for BIOS-821 demo'

    for i in range(4):
        g = root.create_group(f'expt{i:03}')
        g.attrs['created'] = arrow.utcnow().format()
        for j in range(5):
            d = g.create_dataset(f'sim{j:03}', data=np.random.random((1000,1000)))
            d.attrs['modified'] = arrow.utcnow().format()

In [22]:

root = h5py.File('data/sim.hdf5')

In [23]:

list(root.attrs.keys())

Out[23]:

['author', 'description']

In [24]:

for k, v in root.attrs.items():
    print(f'{k:12}: {v}')

author      : Cliburn
description : Simulated data sets for BIOS-821 demo

In [25]:

def show(name):
    print(root[name])

In [26]:

root.visit(show)

<HDF5 group "/expt000" (5 members)>
<HDF5 dataset "sim000": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim001": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim002": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim003": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim004": shape (1000, 1000), type "<f8">
<HDF5 group "/expt001" (5 members)>
<HDF5 dataset "sim000": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim001": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim002": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim003": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim004": shape (1000, 1000), type "<f8">
<HDF5 group "/expt002" (5 members)>
<HDF5 dataset "sim000": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim001": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim002": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim003": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim004": shape (1000, 1000), type "<f8">
<HDF5 group "/expt003" (5 members)>
<HDF5 dataset "sim000": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim001": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim002": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim003": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim004": shape (1000, 1000), type "<f8">

Load a specific data set into memory¶

In [27]:

data = root['expt002/sim003']

In [28]:

data.shape

Out[28]:

(1000, 1000)

In [29]:

list(data.attrs.keys())

Out[29]:

['modified']

In [30]:

data.attrs['modified']

Out[30]:

'2017-07-02 18:15:24+00:00'

Close handler¶

In [31]:

root.close()

JSON¶

JSON (JavaScript Object Notation) is a very simple plain text standard that is now ubiquitously used for data transfer, especially over the web via the Representational State Transfer (REST) protocol.

JSON consists of objects (resemble Python dictionaries, except that all keys must be strings) and arrays (resemble Python lists), with values coming from

strings - must be double quoted
numbers - integers or floats
boolean - true or false
null

Objects and arrays can be nested to arbitrary levels.

The json package in the standard library will convert JSON objects to Python dictionaries and JSON arrays to Python lists, and vice-versa.

A JSON Object¶

Note that there must NOT be a comma after the last key: value pair. This is unlike Python dictionaries where trailing commas are OK.

{
    "class": "Cleric",
    "power source": "Divine",
    "role": "Leader"
}

A JSON Array¶

Here we show an array of objects, each of which has 3 key-value pairs.

[
    {
        "class": "Cleric",
        "power": "Divine",
        "role": "Leader"
    },
    {
        "class": "Fighter",
        "power": "Martial",
        "role": "Defender"
    },
    {
        "class": "Paladin",
        "power": "Divine",
        "role": "Defender"
    },
    {
        "class": "Ranger",
        "power": "Martial",
        "role": "Striker"
    },
    {
        "class": "Rogue",
        "power": "Martial",
        "role": "Striker"
    },
    {
        "class": "Warlock",
        "power": "Arcane",
        "role": "Striker"
    },
    {
        "class": "Warlord",
        "power": "Martial",
        "role": "Leader"
    },
    {
        "class": "Wizard",
        "power": "Arcane",
        "role": "Controller"
    }
]

JSON and Python¶

In [32]:

import requests

JSON from REST API¶

JSON is particularly important for transferring data over the web. More and more databases now allow you to access their data in a granular fashion using REST API calls. We illustrate with retrieval of information about berries that might be used to feed your Pokemon.

Berries are small fruits that can provide HP and status condition restoration, stat enhancement, and even damage negation when eaten by Pokémon. Check out Bulbapedia for greater detail. Description from PokeAPI

In [33]:

response = requests.get('http://pokeapi.co/api/v2/berry/1/')

In [34]:

data = response.json()

In [35]:

data

Out[35]:

{'firmness': {'name': 'soft',
  'url': 'http://pokeapi.co/api/v2/berry-firmness/2/'},
 'flavors': [{'flavor': {'name': 'spicy',
    'url': 'http://pokeapi.co/api/v2/berry-flavor/1/'},
   'potency': 10},
  {'flavor': {'name': 'dry',
    'url': 'http://pokeapi.co/api/v2/berry-flavor/2/'},
   'potency': 0},
  {'flavor': {'name': 'sweet',
    'url': 'http://pokeapi.co/api/v2/berry-flavor/3/'},
   'potency': 0},
  {'flavor': {'name': 'bitter',
    'url': 'http://pokeapi.co/api/v2/berry-flavor/4/'},
   'potency': 0},
  {'flavor': {'name': 'sour',
    'url': 'http://pokeapi.co/api/v2/berry-flavor/5/'},
   'potency': 0}],
 'growth_time': 3,
 'id': 1,
 'item': {'name': 'cheri-berry', 'url': 'http://pokeapi.co/api/v2/item/126/'},
 'max_harvest': 5,
 'name': 'cheri',
 'natural_gift_power': 60,
 'natural_gift_type': {'name': 'fire',
  'url': 'http://pokeapi.co/api/v2/type/10/'},
 'size': 20,
 'smoothness': 25,
 'soil_dryness': 15}

Serializing to disk¶

In [36]:

import json

In [37]:

filename  = 'data/{}.json'.format(data['name'])
filename

Out[37]:

'data/cheri.json'

In [38]:

with open(filename, 'w') as f:
    json.dump(data, f)

De-serializing¶

In [39]:

with open(filename) as f:
    cheri = json.load(f)
cheri

Out[39]:

{'firmness': {'name': 'soft',
  'url': 'http://pokeapi.co/api/v2/berry-firmness/2/'},
 'flavors': [{'flavor': {'name': 'spicy',
    'url': 'http://pokeapi.co/api/v2/berry-flavor/1/'},
   'potency': 10},
  {'flavor': {'name': 'dry',
    'url': 'http://pokeapi.co/api/v2/berry-flavor/2/'},
   'potency': 0},
  {'flavor': {'name': 'sweet',
    'url': 'http://pokeapi.co/api/v2/berry-flavor/3/'},
   'potency': 0},
  {'flavor': {'name': 'bitter',
    'url': 'http://pokeapi.co/api/v2/berry-flavor/4/'},
   'potency': 0},
  {'flavor': {'name': 'sour',
    'url': 'http://pokeapi.co/api/v2/berry-flavor/5/'},
   'potency': 0}],
 'growth_time': 3,
 'id': 1,
 'item': {'name': 'cheri-berry', 'url': 'http://pokeapi.co/api/v2/item/126/'},
 'max_harvest': 5,
 'name': 'cheri',
 'natural_gift_power': 60,
 'natural_gift_type': {'name': 'fire',
  'url': 'http://pokeapi.co/api/v2/type/10/'},
 'size': 20,
 'smoothness': 25,
 'soil_dryness': 15}

Relational Database (SQLite3)¶

SQLite3 is part of the standard library. However, the mechanics of using essentially any relational database in Python is similar, because of the Python DB-API.

In [40]:

import sqlite3

In [41]:

c = sqlite3.connect('data/Chinook_Sqlite.sqlite')

SQLite specific commands to get metadata¶

Unlike SQL syntax for queries, how you get metadata from a relational database is vendor-specific. You’ll have to read the docs to find out what is needed for your SQL flavor.

What tables are there in the database?¶

In [42]:

list(c.execute("SELECT name FROM sqlite_master WHERE type='table';"))

Out[42]:

[('Album',),
 ('Artist',),
 ('Customer',),
 ('Employee',),
 ('Genre',),
 ('Invoice',),
 ('InvoiceLine',),
 ('MediaType',),
 ('Playlist',),
 ('PlaylistTrack',),
 ('Track',)]

What are the columns of the table “Album”?¶

In [43]:

list(c.execute("PRAGMA table_info(Album);"))

Out[43]:

[(0, 'AlbumId', 'INTEGER', 1, None, 1),
 (1, 'Title', 'NVARCHAR(160)', 1, None, 0),
 (2, 'ArtistId', 'INTEGER', 1, None, 0)]

Standard SQL statements with parameter substitution¶

Note: Using Python string substitution for Python defined parameters is dangerous because of the risk of SQL injection attacks. Use parameter substitution with ? instead.

Do this¶

In [44]:

t = ['%rock%']
list(c.execute("SELECT * FROM Album WHERE Title like ? LIMIT 5;", t))

Out[44]:

[(1, 'For Those About To Rock We Salute You', 1),
 (4, 'Let There Be Rock', 1),
 (59, 'Deep Purple In Rock', 58),
 (108, 'Rock In Rio [CD1]', 90),
 (109, 'Rock In Rio [CD2]', 90)]

Not this¶

In [45]:

t = ["'%rock%'"]
list(c.execute("SELECT * FROM Album WHERE Title like %s LIMIT 5;" % t[0]))

Out[45]:

[(1, 'For Those About To Rock We Salute You', 1),
 (4, 'Let There Be Rock', 1),
 (59, 'Deep Purple In Rock', 58),
 (108, 'Rock In Rio [CD1]', 90),
 (109, 'Rock In Rio [CD2]', 90)]

Using SQL magic functions¶

We will use the ipython-sql notebook extension for convenience. This will only work in notebooks and IPython scripts with the .ipy extension.

In [46]:

import warnings

with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    %load_ext sql

Configuring the SqlMagic extension¶

In [47]:

%config SqlMagic

SqlMagic options
--------------
SqlMagic.autolimit=<Int>
    Current: 0
    Automatically limit the size of the returned result sets
SqlMagic.autopandas=<Bool>
    Current: False
    Return Pandas DataFrames instead of regular result sets
SqlMagic.column_local_vars=<Bool>
    Current: False
    Return data into local variables from column names
SqlMagic.displaylimit=<Int>
    Current: 0
    Automatically limit the number of rows displayed (full result set is still
    stored)
SqlMagic.dsn_filename=<Unicode>
    Current: 'odbc.ini'
    Path to DSN file. When the first argument is of the form [section], a
    sqlalchemy connection string is formed from the matching section in the DSN
    file.
SqlMagic.feedback=<Bool>
    Current: True
    Print number of rows affected by DML
SqlMagic.short_errors=<Bool>
    Current: True
    Don't display the full traceback on SQL Programming Error
SqlMagic.style=<Unicode>
    Current: 'DEFAULT'
    Set the table printing style to any of prettytable's defined styles
    (currently DEFAULT, MSWORD_FRIENDLY, PLAIN_COLUMNS, RANDOM)

In [48]:

%config SqlMagic.displaylimit=10

Connect to SQLite3 database¶

See SQLAlchemy connection strings for how to connect to other databases such as Oracle, MySQL or PostgreSQL.

In [49]:

%sql sqlite:///data/Chinook_Sqlite.sqlite

Out[49]:

'Connected: None@data/Chinook_Sqlite.sqlite'

In [50]:

%sql SELECT * from Album LIMIT 5;

Done.

Out[50]:

AlbumId	Title	ArtistId
1	For Those About To Rock We Salute You	1
2	Balls to the Wall	2
3	Restless and Wild	2
4	Let There Be Rock	1
5	Big Ones	3

In [51]:

%sql SELECT * from Artist LIMIT 5;

Done.

Out[51]:

ArtistId	Name
1	AC/DC
2	Accept
3	Aerosmith
4	Alanis Morissette
5	Alice In Chains

In [52]:

%%sql
SELECT Artist.Name, Album.Title
FROM Album
INNER JOIN Artist on Album.ArtistId = Artist.ArtistId
ORDER BY Artist.Name ASC
LIMIT 5;

Done.

Out[52]:

Name	Title
AC/DC	For Those About To Rock We Salute You
AC/DC	Let There Be Rock
Aaron Copland & London Symphony Orchestra	A Copland Celebration, Vol. I
Aaron Goldberg	Worlds
Academy of St. Martin in the Fields & Sir Neville Marriner	The World of Classical Favourites

You can assign results of queries to Python names¶

In [53]:

result = %sql SELECT * from Album;

Done.

In [54]:

type(result)

Out[54]:

sql.run.ResultSet

In [55]:

result[2:4]

Out[55]:

[(3, 'Restless and Wild', 2), (4, 'Let There Be Rock', 1)]

You can use Python variables in your queires.¶

Use :varname where you want to use a Python variable in your query.

In [56]:

artist_id = 10

%sql select * from Artist where ArtistId < :artist_id;

Done.

Out[56]:

ArtistId	Name
1	AC/DC
2	Accept
3	Aerosmith
4	Alanis Morissette
5	Alice In Chains
6	Antônio Carlos Jobim
7	Apocalyptica
8	Audioslave
9	BackBeat

In [57]:

word = '%rock%'

%sql select * from Album WHERE Title LIKE :word;

Done.

Out[57]:

AlbumId	Title	ArtistId
1	For Those About To Rock We Salute You	1
4	Let There Be Rock	1
59	Deep Purple In Rock	58
108	Rock In Rio [CD1]	90
109	Rock In Rio [CD2]	90
213	Pure Cult: The Best Of The Cult (For Rockers, Ravers, Lovers & Sinners) [UK]	139
216	Hot Rocks, 1964-1971 (Disc 1)	142

Convert to `pandas` dataframe¶

In [58]:

df = result.DataFrame()
df.head(5)

Out[58]:

	AlbumId	Title	ArtistId
0	1	For Those About To Rock We Salute You	1
1	2	Balls to the Wall	2
2	3	Restless and Wild	2
3	4	Let There Be Rock	1
4	5	Big Ones	3

Python: Data Management¶

Pickle¶

Pickle protocols¶

Serialize¶

De-serialize¶

Serialize to byte string¶

De-serialize from byte string¶

Feather¶

Download data from Google Finance¶

Format closing prices¶

Serialize¶

De-serialize¶

HDF5¶

Saving to disk¶

Load a specific data set into memory¶

Close handler¶

JSON¶

A JSON Object¶

A JSON Array¶

JSON and Python¶

JSON from REST API¶

Serializing to disk¶

De-serializing¶

Relational Database (SQLite3)¶

SQLite specific commands to get metadata¶

What tables are there in the database?¶

What are the columns of the table “Album”?¶

Standard SQL statements with parameter substitution¶

Do this¶

Not this¶

Using SQL magic functions¶

Configuring the SqlMagic extension¶

Connect to SQLite3 database¶

You can assign results of queries to Python names¶

You can use Python variables in your queires.¶

Convert to `pandas` dataframe¶

Page contents

Previous page

Next page

This Page

Python: Data Management¶

Pickle¶

Pickle protocols¶

Serialize¶

De-serialize¶

Serialize to byte string¶

De-serialize from byte string¶

Feather¶

Download data from Google Finance¶

Format closing prices¶

Serialize¶

De-serialize¶

Sharing data frames between R and Python¶

HDF5¶

Saving to disk¶

Load a specific data set into memory¶

Close handler¶

JSON¶

A JSON Object¶

A JSON Array¶

JSON and Python¶

JSON from REST API¶

Serializing to disk¶

De-serializing¶

Relational Database (SQLite3)¶

SQLite specific commands to get metadata¶

What tables are there in the database?¶

What are the columns of the table “Album”?¶

Standard SQL statements with parameter substitution¶

Do this¶

Not this¶

Using SQL magic functions¶

Configuring the SqlMagic extension¶

Connect to SQLite3 database¶

You can assign results of queries to Python names¶

You can use Python variables in your queires.¶

Convert to pandas dataframe¶

Convert to `pandas` dataframe¶